eal-time data compression isa hot topic at 
the moment, with the likes of Stacker 
being used to squeeze a quart of data into 
a pint disk drive. So how does the magic 
work? 

Ifyou already know something about 
coding theory, you might have heard of 
Huffman coding and statistical modelling of data, 
and you might expect the data compression 


*™nethods currently causing a revolution to be 


rooted in some sophisticated variation on these 
methods. But while it’s true that these traditional 
methods are capable of excellent compression 
performance, they’re very slow to implement 
and therefore not really suitable for real-time 
use. So a different approach is required. 

The breakthrough came in 1977 with a theo- 
retical paper by Jacob Ziv and Abraham Lempel. 
This paper started the whole subject of diction- 
ary compression. Although it was a very theo- 
retical-sounding paper, it’s amazing no-one had 
thought of the technique before. It’s even possi- 
ble that someone might actu- 
ally have implemented it in a 
practical form before this. 

Consider the text of this 
article as a subject for com- 
pression. Instead of giving 
you each word spelled out 
letter by letter, I could give 
you the page number and line 
number of each word in a 
standard dictionary. You can 
easily imagine, and a back-of-the-envelope calcu- 
lation will quickly confirm, youreduce theamount 
of data needed to store the text if you use diction- 
ary references. This is the basis of dictionary 
compression. 

But there’s a hidden cost. You can refine the 
dictionary used to a minimum — for example, if 
you used the Shorter Oxford to encode this article 
there would be a lot of words that were never 
referred to and could be omitted. But the fact 
remains that to make sense of the data, you need 
the dictionary — so the storage needed has to be 
taken into account. You may be able to store this 
article in only a few kilobytes using dictionary 
coding, but you’d still need several megabytes to 
store the dictionary required to decode it. 

Clearly, the tantalising problem of dictionary 
coding is howto construct the dictionary without 
incurring large storage overheads. This is where 
‘sliding window’ dictionary compression comes 
in. Why not use the document as its own diction- 
ary? Before you begin to worry what on earth I’m 
on about, it’s time to be more precise. 


z 


Master class 


Hitting the buffers 

Start off with a buffer that acts as a window onto 
the file being compressed. Divide the buffer into 
two portions (see Figure 1) — the chunk to the left 
used as the dictionary and the chunk to the right 
used as a look-ahead buffer that holds the section 
of the file we’re trying to compress. What hap- 
pens now is that you try to find a match in the 
dictionary section for the string in the look-ahead 
buffer. Once you find a match, the string in the 
look-ahead is coded by generating a three-ele- 
ment token: 


Match position,match length,next character 


So, for example, the token 32,14,d means 
that the phrase in the look-ahead buffer matches 
the dictionary at position 32 and the match con- 
tinues for 14 characters. The first character in the 
look-ahead after the match fails is ‘d’. This token 
can be decoded back to the phrase it represents 
simply by getting the 14 characters starting at 


position 32 from 
the dictionary 
and adding a letter ‘d’ onto the end. Thus we’ve 
succeeded in reducing a phrase of 15 characters 
to a three-element token, and this must take less 
space to store. Notice that as the dictionary is just 
the first part of the file read into the window, 
there’s no need to store a separate dictionary — 
the file is its own dictionary! 

You're probably already thinking of reasons 
to criticise the idea of using the start ofa file as a 
dictionary forthe rest, but hang on—there’s more 
to the method. Looking up the look-ahead buffer 
in the dictionary is simple enough, but there’s 
one subtlety involved. You don’t want to code the 
phrase in the look-ahead buffer as the first match 
in the dictionary but as the longest match. Why? 
Well, no matter what sort of match you get, the 
token still consists of three elements. So obvi- 
ously you'll do better to code as many characters 
per token as possible. This makes the first part of 
the algorithm: 


1. Find the longest match between the phrase in 
the look-ahead buffer and the dictionary, and 
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If you want to squash 
something, use a dictionary. 
And that goes for data as 
well as wasps. Mike James 
explains 


code this as a three-element token P%,N%,C$ 
where P% is the position of the match, N% the 
length and C$ the character following the match- 
ing phrase. 


On the slide 

Now what do you do? You've successfully coded 
N%+1 and so finished with N%+1 characters. 
This is where the ‘sliding’ part of the sliding 
windows algorithm comes in. Having coded 


N%+1 characters, why notjust slide the window 
N%+1 characters to the right to read in N%+1 
more characters at the left? (See Figure 2.) The 
+1 characters that were in the look-ahead are 

w in the dictionary. The look-ahead buffer now 
contains nothing but characters that have yet to 
be coded, so we can repeat the step of finding the 
best match and issuing a token. But wait — what 
about the N%+1 characters that fell off the end 
ofthe dictionary buffer when the window moved? 
The answer is that if the window slides over the 
entire file so that data enters at the left and is 
coded as is moves from the look-ahead to the 
dictionary, the data that’s lost out of the end ofthe 
window will always have been coded at some 
time in the past. So the second step is: 


2. Shift the window N%+1 characters to the right 
and repeat step 1. 


That’s all there is to it. You don’t need to store 
a separate dictionary, because the file in both its 
compressed and its uncompressed form is the 
dictionary. This is the method all of the super- 
quick and powerful data compression methods 
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use — PKZip, Stacker and even the QIC-122 tape 
compression standard! Needless to say, they all 
throw in their own added extras to make the 
process more efficient, but sliding window com- 
pression is at the bottom of them all. 

This form of sliding window compression is 
generally called LZ77 after its inventors. It’s 
remarkably good at file compression, because 
the dictionary changes to suit the section of the 
file currently being processed. The idea is that 
most files follow a principle of ‘local similarity’. In 
other words, if a symbol or phrase has just 
occurred it’svery likely to occur again. Ofcourse, 
the efficiency of the whole scheme depends on 
the size of the window. 


Getting started 
There are plenty of practical details that I’ve 
glossed over in this description, but these are 
the kind of thing a programmer can easily notice 
and solve. The main question that must be wor- 
rying you concerns how the whole coding proc- 
ess gets going. When the window first slides 
over a data file, there’s nothing in the dictionary 
~~~ data slides in until it fills the look-ahead buffer, 
and only then passes into the dictionary. Up to 
this point the contents of the look-ahead have to 


tionary ready to help with the decoding of the 
next token. 


Do it yourself 

The best way to make sure you understand is to 
do — so let’s write a compression program. This 
is a demo and not intended to compete with the 
real thing. To do a good job on sliding windows 
compression you have to pay attention to detail, 
which makes the task a much bigger one and also 
obscures the principles. Subroutine 
Compress(F1,F2), written in QBasic, will com- 
press the file opened as F1 into the file opened as 
F2. Both files have to be opened for binary. 


SUB Compress (F1, F2)_! 
CALL InitLoad(F1) 
DO CALL LongMatch(P%, N%)_! 
CALL IssueT oken(P%, N%, F2).J 
CALL ShiftWindow(N%) J 
CALL MoreData(F1,N%) | 
IF C$(W% —L% + 1) ="" THEN EXITDO JJ 
LOOP! 
END IF! 
END SUB_! 


Notice that W% is the length of the window, 


Coded 


Characters ‘drop off’ as window slides left 


/ | 


“~@ The window shifts to the left to move the coded portion of the text from the look-ahead buffer into the 


dictionary 


be coded while the dictionary is empty! The 
simple solution is to use a token of the form 
0,0,C$ to code any characters that don’t match 
anything in the dictionary. You should be able to 
see that the first portion of any file will be codes 
as a stream of 0,0,C$ tokens which become less 
as the dictionary fills. 

Decoding also starts out with an empty dic- 
tionary. The first token must be a 0,0,C$ type, 
and this is used to place a character in the look- 
ahead buffer—which would now be betternamed 
an expansion buffer, as each token is read from 
the file and expanded into the look-ahead. Note 
that the expansion is done so that the phrase is 
inserted atthe boundary between the dictionary 
and the look-ahead. The expansion works by 
taking the N% characters starting at P% from 
the dictionary and then adding the character in 
the token to the end. Thus after each token is 
expanded the look-ahead contains N%+1 new 
characters. These are then shifted into the dic- 


lation copies the files 


bait 


L% is the length of the look-ahead, and C$(W%) 
is the full window buffer. All of these are global 
variables. This means that the dictionary part of 
the window is C$(1) to C$(W%-L%) and the look- 
ahead is C$(W%-L%+1) to C$(W%). 

Subroutine InitLoad(F1) simply reads enough 
data from the input file to fill the look-ahead 
window ready for compression to start: 


SUB InitLoad (F1).) 
CHAR$ = "J 
FOR i% = 110 L%.! 
GET #F1, , CHARS J 
C$(W% —L% + 1%) = CHARS J 
NEXT i%_! 
END SUB! 


Subroutine LongMatch is just a boring old ‘find 
the biggest by exhaustive search’ routine: 


SUB LongMatch (P%, N%) 
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P% = 0.) 

N% = 0.) 

FOR j% = 1T0W%-L%J 
CALL Match(j%, thisN%) | 
IF thisN% > N% THEN <I 

N%=thisN% 
PH=j% I 
ENDIF JJ 
NEXT j%.! 
END SUB! 


The loop calls Match(j%,thisN%) to compare 
the contents of the look-ahead with the diction- 
ary starting at position j% and return the number 
of characters that match. Subroutine Match is 
just: 


SUB Match (j%, thisN%) | 
i% =W%-L% + 15) 
thisN% = 0.) 
D0.) 
IF C$(j% + thisN%) <> C$(i% + thisN%) THEN 
EXIT DO! 
thisN% = thisN% +1 JJ 
LOOP UNTIL (i% + thisN%) > (W% — 1)! 
END SUB! 


This routine is 
slightly more compli- 
cated than you might 
think. The reasonis that 
the dictionary match 
can continue into the 
look-ahead buffer. This 
may seemvery strange, 
butitworks OK because 
when you decode the 
token the look-ahead 
buffer is being filled 
faster than it’s being 
used to decode — try it 
and see. However, it’s 
importantnotto use the 
very last character in 
the look-ahead, as this 
has to be issued as part 
ofthe token! An alterna- 
tive would be to allow 
some sort of null char- 
acter to be used in the 
token, but this is more complicated still. 

~ Now that we have the best dictionary match 
in P% and N% it’s time to make up the token. This 
is the messy bit that I’d rather not deal with! The 
reason is that it will vary greatly depending on 
the details of the language used and on the size of 
the buffer. For simplicity I'll assume that the 
length of the buffer and the look-ahead is such 
that both P% and N% can be coded as in a single 
two-byte word: 


The installation 


SUB IssueToken (P%, N%, F2)_! 

DIM TOKEN AS STRING * 3.) 

Word% = P% * 256 + N%_| 

TOKEN = MKI$(Word%) + C$(W% —L% + N% + 1)! 
PUT #F2, , TOKEN.| 

END SUB! 


Notice the use of MKI$, which converts a 
two-byte integer into an equivalent two-byte string 
suitable for writing out to a binary file. 
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The next’ Step int: Rorribly inefficient 
ShiftWindow routine, which actually does move 
the data in the window N%+ 1 places to the right: 


SUB ShiftWindow (N%) _| 

FOR i% = 1T0W%-N%-1.) 
C$(i%) = C$(I% +N% +1) JJ 

NEXT i%_! 

END SUB! 


And finally, MoreData fills the look ahead buffer 
with N%+1 more characters to top it up after the 
shift: 


SUB MoreData (F1, N%) | 
CHAR$ =" Bi 
FOR i% =1T0N% +12) 
GET #F1,, CHARS <I 
IF EOF(F1) THEN CHAR$ =" J 
C$(W% —N%—1 + i%) = CHARS <I 
NEXT i%.1 
END SUBJ 


Notice that there is some very tricky index 
work going on in the loop - I’m sure there’s a 
better way of doing this. Also notice the need to 
feed null strings into the buffer if the file runs out 
before the refill. This is the only situation in 
which a null string can get into the window, so 
this is used by the coding loop as an indictor that 
the job is over. 

This brings us to yet another subtle point. 
Notice that the coding loop keeps on going until 
a null string hits the boundary between the dic- 
tionary and the look-ahead. This certainly means 
that all of the file has been coded, but it can also 
result in one extra character, a null string, being 
coded. It only happens in one situation but can 
you see what it is? I decided that as this was only 
a demo, adding a null string to a file wasn’t a 
terrible sin — so I left it. 


Decompression 
Obviously I can’t leave you with just a compres- 
sion subroutine — after all, how would you test it? 
Decomp(F1,F2) will decompress the file opened 
as F1 into the file opened as F2. Again, C$, W% 
and L% are global. 


SUB Decomp (F1, F2)_! 

DO.) 
CALL GetT oken(P%, N%, CHAR$, F1)_! 

IF EOF(F1) THEN EXIT DO_J 
CALL ExpandT oken(P%, N%, CHAR$, F2)_J 
CALL ShiftWindow(N%) J 

LOOP! 

END SUB! 


Decomp is much simpler than the compres- 
sion routine, but it still has a number of tricks up 
its sleeve. GetToken is simple enough: 


SUB GetToken (P%, N%, CHAR$, F1)_! 
DIM TOKEN AS STRING * 3.) 

IF EOF(F1) THEN EXIT SUB_| 

GET #F1, , TOKEN! 

word% = CVI(MID$(TOKEN, 1, 2)) J 
N% = word% MOD 256_! 

P% = word% \ 256_| 

CHAR$ = MID$(TOKEN, 3, 1) 1 

END SUB_| 
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Notice that this delivers the token fully de- 
coded into P%, N% and CHAR$. 

ExpandToken does exactly whatits name sug- 
gests: expands the current token into the look- 
ahead. Almost as a by-product of this expansion, 
it also copies the data, as it’s produced, into the 
output file: 


SUB ExpandToken (P%, N%, CHAR$, F2) | 

FOR i% = 1 TON%_| 
C$(W% —L% + 1%) = C$(P% + 1% —1) ed 
PUT #F2, , C$(P% + i%-1) JJ 

NEXT i%_] 

C$(W% — L% + N% + 1) = CHARS. 

PUT #F2, , CHAR$_| 

END SUB_| 


Finally, all that’s left is to shift the N%+1 new 
characters in the look-ahead into the dictionary: 


SUB ShiftWindow (N%) <I! 
FOR i% =1TOW%—-N%-1.) 
C$(i%) = C$(I% +N% +1) JI 


NEXTi% J 
FOR i% = W%-N%T OW%) 
C$(i%) =-_™ sl 
NEXTi% J 
END SUB! 


Most of the tokens in the coded file are 0,0 or 
P%,1 tokens, and this means using three bytes to 
represent a character that took a single byte to 
represent in the original file. You have to do 
better then this when the compression isn’t work- 
ing too well to say ahead. In most cases this 
means using a single-bit flag to indicate that a 
character is an uncoded 0,0 token. You might 
even not bother with P%,1 tokens and prefer to 
pass the uncoded characters along. You also 
need to pack the P%, N%, C$ tokens into as few 
bits as possible when you do decide it’s worth 
issuing a token. 

Perhaps we can ignore such refinements and 
gain a worthwhile compression by increasing 
the window size? We can — a 128-byte window 
does produce reasonable gains — but increase it 
to, say, 4K and you hit the second snag: speed. 
The time it takes to search the dictionary for the 
largest match increases with the square of the 
size of the windows. Add to this the crazy idea of 
actually shifting the data each time, and you have 
a good example of how not to implement an 
algorithm. 

To make it all run at a reasonable speed yor 
have to do two things. The first is to use pointe: 
to the data so that you can shift the window by 
adjusting the pointers rather than by actually 
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Watch the big match while you can — next year it'll probably be on Sky 


Although the second FOR loop that nulls the 
remainder of the look-ahead buffer isn’t really 
needed, it helps make any displays of the win- 
dows you may add by removing old data. 


No small achievement 
You should be able to make sense of the sliding 
window algorithm from the descriptions and the 
routines given here. For my own benefit I added 
a rolling display to the compress and expand 
subroutines (see screenshot) so thatthey showed 
the state of the window, including pointers to the 
match position and length—-this enabled me to sit 
back and watch how the compression was going 
and how effective the dictionary was. It’s tempt- 
ing to cheer when a token like 32, 15,c goes past. 
So how well does it do? If you try a modest 
window size, say W%=64 and L%=16, you’re in 
for a shock. The compression routine actually 
increases the size of the file! The reason is that 
we've been extravagant in the coding of tokens. 


moving it. This increases efficiency, but makes 
the algorithm more difficult to follow. The sec- 
ondis to keep the dictionary in a form that makes 
it easier to search to find the largest match. This 
implies some sort of tree structure, which is 
indeed exactly what you need. 

You can tell there’s a lot of work to be done in 
turning these demonstration routines into a real 
compression/decompression pair. Fortunately 
there’s a simpler solution: get hold of a copy of 
PKZip and Unzip and, if you’re serious, the com- 
pression library that goes with them. There are 
other forms of sliding window compression that 
work better in some situations — LZ78, for exam- 
ple. Unfortunately some of these have been pat- 
ented, not just copyrighted. In particular LZW, a 
variant of LZ78, is patented by Unisys. Personally 
I think thisisa crazy situation, but be warned that 
if you improve on the methods described above 
you may need to check that you haven’t acciden- 
tally infringed a patent. m 


